Flexible Computer Assisted Transcription of Historical Documents Through Subword Spotting

نویسندگان

  • Brian Davis
  • Robert Clawson
  • William Barrett
چکیده

In the absence of accurate handwriting recognition for historical documents, computer assisted transcription (CAT) methods move into the spotlight. We explore some of the weaknesses of current CAT systems and propose a CAT system which relies on subword spotting that overcomes most of these. The system is ideal crowdsourcing transcription to mobile users.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparing decoding strategies for subword-based keyword spotting in low-resourced languages

For languages with limited training resources, out-ofvocabulary (OOV) words are a significant problem, both for transcription and keyword spotting. This paper investigates the use of subword lexical units for keyword spotting. Three strategies for using the sub-word units are explored: 1) converting word-based lattices to subword lattices after decoding, 2) performing a separate decoding for ea...

متن کامل

Exploiting Collection Level for Improving Assisted Handwritten Words Transcription of Historical Documents

Transcription of handwritten words in historical documents is still a difficult task. When processing huge amount of pages, document centered approaches are limited by the trade-off between automatic recognition errors and the tedious aspect of human user annotation work. In this article, we investigate the use of inter page dependencies to overcome those limitations. For this, we propose a new...

متن کامل

HADARA – A Software System for Semi-Automatic Processing of Historical Handwritten Arabic Documents

Recently, many big libraries all over the world have been scanning their collections to make them publicly available and to preserve historical documents. We present a modular software system which can be used as a tool for semi-automatical processing of historical handwritten Arabic documents. The development of this system is part of the HADARA project which aims for historical document analy...

متن کامل

Radial Line Fourier Descriptor for Segmentation-free Handwritten Word Spotting

Automatic recognition of historical handwritten manuscripts is a daunting task due to paper degradation over time. Recognition-free retrieval or word spotting is popularly used for information retrieval and digitization of the historical handwritten documents. However, the performance of word spotting algorithms depends heavily on feature detection and representation methods. Although there exi...

متن کامل

Enhancing low resource keyword spotting with automatically retrieved web documents

Keyword Spotting (KWS) systems developed for low resource languages with very little transcribed audio suffer due to a small vocabulary (high out-of-vocabulary (OOV) rate) and a weak language model. In this paper, we propose to augment such systems using automatically retrieved web documents. Our procedure can find large volumes of web documents similar to a small pool of training transcription...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016